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An important task for Homeland Security is the prediction of threat vulnerabilities, such as through the de- 
tection of relationships between seemingly disjoint entities. A structure used for this task is a semantic graph, 
also known as a relational data graph or an attributed relational graph. These graphs encode relationships as 
typed links between a pair of typed nodes. Indeed, semantic graphs are very similar to semantic networks used 
in AL The node and link types are related through an ontology graph (also known as a schema). Furthermore, 
each node has a set of attributes associated with it (e.g., "age" may be an attribute of a node of type "person"). 
Unfortunately, the selection of types and attributes for both nodes and links depends on human expertise and is 
somewhat subjective and even arbitrary. This subjectiveness introduces biases into any algorithm that operates 
on semantic graphs. Here, we raise some knowledge representation issues for semantic graphs and provide 
some possible solutions using recently developed ideas in the field of complex networks. In particular, we use 
the concept of transitivity to evaluate the relevance of individual links in the semantic graph for detecting rela- 
tionships. We also propose new statistical measures for semantic graphs and illustrate these semantic measures 
on graphs constructed from movies and terrorism data. 



I. INTRODUCTION 

A semantic graph is a network of heterogeneous nodes and 
links. In contrast to the usual mathematical description of a 
graph, semantic graphs have different types of nodes, and in 
general, different types of links. Also called attributed rela- 
tional graphs 1 6] and relational data graphs (used in the knowl- 
edge discovery literature), it is clear that the power of these 
graphs lies not only in their structure but also in the semantic 
information that resides on their nodes and links. Examples 
of semantic graphs include citation networks where the nodes 
do not simply consist of papers, but also consist of authors, 
institutions, journals, and conferences. Another example is 
the Internet Movie Database where the nodes may be persons 
(actors, directors, etc.), movies, studios, and awards, among 
others. In Homeland Security, these graphs are used in a vari- 
ety of information analysis tasks Irj fl2lfT3l Wfa . In particular, 
such graphs may be used for predicting threat vulnerabilities. 

Data for semantic graphs come from relations parsed from 
text documents and/or data from relational databases. Our 
motivation for this work comes from our experience in con- 
structing semantic graphs from two sources of data — movies 
data and terrorism data — to be discussed at the end of this pa- 
per. In both these cases, we were faced with a wide variety of 
choices: what are the node types, what are the link types, and 
how do these choices affect the algorithms that we intend to 
use on these graphs? 

Several types of algorithms operating on semantic graphs 
are of interest to us. For example, to determine the nature of 
a possible relationship between two entities, a subgraph con- 
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sisting of the shortest paths (or another metric) between two 
nodes in the semantic graph may be constructed and exam- 
ined 10]. We refer to this process as relationship detection. 
Fast algorithms based on heuristic search (which improve on 
breadth-first search or bi-directional search) are available for 
this task, which either use or do not use the semantic informa- 
tion in the graph These algorithms, however, depend on 
knowing which links (or link types) in the semantic graph are 
useful for detecting relationships. For example, two people 
who share a connection to "San Francisco" because they were 
born there are unlikely to have any real-life connection. One 
of the goals of this paper is to present automatic algorithms for 
determining which are useful links for relationship detection, 
as well as present concepts to help answer related questions. 

In the past few years, a ne w field call ed comple x networks 
(see, e.g.. lAlbert&Barabasil (2002) and lNewmar] (2003)) has 
emerged to study the structure of real-world networks. Statis- 
tical tools for characterizing graphs and networks have been 
developed, with the impetus of understanding the relationship 
between the structure and function of networks. Computer 
techniques have allowed these statistical measurements to be 
performed on very large real-world networks. In this paper we 
generalize some of these techniques in order to apply them to 
semantic graphs. For example, some types of nodes in se- 
mantic graphs can be connected to many other types of nodes, 
but generally have few actual links. We quantify this concept 
and hypothesize that nodes such as these are not useful for 
relationship detection. In addition, the concept of transitivity 
in social network analysis (called clustering coefficient in the 
complex networks literature) is useful for determining which 
are useful links for relationship detection. 

In the following, we begin by describing semantic graphs 
and ontologies. We then use the concept of transitivity for 
evaluating links and link types for relationship detection. An 
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FIG. 1: A small ontology consisting of three node types. 



important aspect of this paper is a presentation of new statis- 
tical measures for semantic graphs, as well as issues related 
to the scale (level of detail) of semantic graphs. Examples of 
semantic graphs for movies and terrorism data are given near 
the end of the paper. 



II. SEMANTIC GRAPHS AND ONTOLOGIES 

A semantic graph consists of nodes and directed links, with 
each node having a type (e.g., movie). The set of types is usu- 
ally small compared to the number of nodes. Each node is 
also labeled with one or more attributes identifying the spe- 
cific node (e.g., Shrek) or gives additional information about 
that node (e.g., gross income). Links may also have types, for 
example, the (person — > movie) link may be of type "acted- 
in," or "directed." (In this case, multigraphs, or graphs that 
may have multiple links between the same pair of nodes, are 
possible.) In some semantic graphs, the meaning of a link 
between any two nodes is clear (although different between 
different pairs of node types), and no link types need to be de- 
fined. Fin ally, lin ks may also have attributes. For additional 
details, see lSowal ("19841. 

Depending on the types of nodes and links and on the avail- 
able information, certain relations can or cannot exist. The set 
of relations that can exist in a given semantic graph can be de- 
scribed by an auxiliary graph called an ontology, or a schema 
fllll . More often, an ontology graph is created first by defin- 
ing the types of relations that the semantic graph will encode. 
A small example of an ontology is given in Figure^ showing 
three node types: person, meeting and city. 

Special links in an ontology graph could describe is-a and 
part-of relationships among node types. This is a node type 
hierarchy that will be briefly mentioned when we discuss the 
scale of semantic graphs. 



III. TRANSITIVITY FOR EVALUATING NODES AND 
EDGES 

Consider a node "San Francisco" of type "city" in a seman- 
tic graph, and suppose we have a database of people which 
includes city of birth among the data fields. A node "Alice" 
of type "person" may be linked to the node "San Francisco" if 



Alice was born in San Francisco. Other nodes linked to node 
San Francisco imply a relationship to San Francisco and in 
turn their relation to Alice. However, it is not clear that such 
relationships give useful information about Alice since most 
entities a short graph distance away from "Alice" will have no 
real-life connection to Alice. 

On the other hand, people born in a city such as "Tikrit," 
may have a much higher likelihood of knowing each other, 
that is, it may be important in this case to be able to asso- 
ciate two people through their city of birth. Instead of using 
a human with potential biases to evaluate nodes and links, an 
automatic procedure is desirable for objectively determining 
which nodes and links should be used in the semantic graph 
for relationship detection. 

Another example is nodes of type "date." Dates could rep- 
resent birthdates, dates of meetings, etc. For example, a node 
for a person born on 9-11-2001 may be linked to a node la- 
beled "9-1 1-2001." However, two events sharing a date rarely 
predicts that two events are related. Our bias is to treat dates 
as attributes of nodes, rather than as its own node (with the 
type "date"). Topologically, a "date" node may be connected 
to many other types of nodes, but generally each date node is 
connected to only a small number of other nodes. This may be 
an unbiased indication that a date is not useful for relationship 
detection. 



A. The transitivity concept 

The concept of link transitivity is useful to address some of 
the above issues. If a node i has a link to node j and node 
j has a link to node k, then a measure of transitivity in the 
network is the probability that node i has a link to node k. 
In social networks and many other networks categorized as 
small-world networks, this probability is high. This is natural 
in social networks because a friend of a friend is also a friend 
in proportion that is much higher than in a random network. In 
general, we refer to j as a neighbor of i if i and j are directly 
connected in a graph. Also, we refer to the degree of a node 
as the number of neighbors it has. 

The concept of transitivity is quantified as follows. The 
clustering coefficient of a node, denoted by C(i), is a measure 
of the connectedness between the neighbors of the node. Let 
ki denote the degree of node i, and let Ei denote the number 
of links between the ki neighbors. Then, for an undirected 
graph, the quantity 1 19] 

C(1 > = mrm m 

is the ratio of the number of links between a node's neighbors 
to the number of links that can exist. We define C(i) to be 
when ki is or 1. When C(i) is averaged over all nodes 
in the graph, we have the clustering coefficient for a graph. 
Note that high average clustering coefficient does not imply 
the existence of clusters or communities (subgraphs that are 
internally more highly connected than externally) in the graph. 



B. Relevance of a node 




We consider the problem of determining whether a node in 
a semantic graph (e.g., "San Francisco" in a previous exam- 
ple) is useful for relationship detection. Consider a node i 
which has links to many other nodes. For now, we assume the 
links are of all the same type. To evaluate whether or not i is 
useful for relationship detection, we examine whether or not 
the neighbors of i are actually related in the semantic graph 
with high frequency. Whether or not two neighbors are re- 
lated is decided by whether or not a link exists between the 
two neighbors. (A weaker condition if this does not hold is 
whether the two neighbors are linked via a third node which 
is already deemed a useful node for relationship detection.) 
This leads to the use of the clustering coefficient defined in 
Equation to measure the relevance of a node i with degree 
greater than 1. The equation can be generalized so that Ei 
counts links with the weaker condition described above. A 
threshold r is needed and if C(i) > r then i is a useful node. 
If i is not a useful node, all the links involving i should not be 
used for relationship detection and could be removed from the 
semantic graph. If these links are removed, i could be made 
an attribute of the nodes that i originally linked to, in order 
not to lose any information. 

The above can be generalized for semantic graphs when i 
is linked via many different types of links. In this case, in- 
stead of a count of relationships involving pairs of neighbors 
of i, a matrix M(ii , t<z) is used instead. Here M{t\ , £2) counts 
the number of relationships between pairs of neighbors (a, b), 
where a is linked to i via type t\ and b is linked to i via type 
t2- Small entries in this matrix gives pairs of link types (as- 
sociated with i) that should not be traversed in relationship 
detection. 




FIG. 2: A particular ontology for which neighbors of a of type 8 can 
never be connected to neighbors of type /3 or 7. 



There are many applications of this relevance measure. For 
example, pairs of nodes with no existing link can be evalu- 
ated to check if a latent link might exist. In another exam- 
ple, the relevance measure can be computed for all links of a 
given type. A low average of this relevance measure indicates 
that the given link type is not useful for relationship detec- 
tion; there is not a strong relation between nodes incident on a 
link with the given type. A high relevance measure for a link 
when the average relevance measure for the link type is low 
(and vice-versa) indicates an outlier that may be interesting to 
investigate. This relevance measure must be used carefully, 
however, since it uses links that it assumes confers bona fide 
relationships. 

It must also be recognized that a low relevance measure for 
an individual link does not imply that the link is unimportant. 
On the contrary, the notion of the "strength of weak ties" ificll 
suggests that these links are critical in some sense. It is when 
almost all links of the same type have low relevance measure 
(and this link type is not a "secretly knows" b) that this link 
type should not be used in relationship detection. 

D. Generalization of clustering coefficient for semantic graphs 



C. Relevance of a link 

The relevance of an existing or potential relationship be- 
tween two nodes a and b can be evaluated by how many neigh- 
bors they have in common. More precisely a relevance mea- 
sure may be defined as 



S{a,b) 



\N(a,b)\ 
\T(a,b)\ 



(2) 



where 



N(a, b) = {w I w is linked to a and b,w ^ a,w ^ b} 



and 

T(a, b) = {w I w is linked to a or b, w ^ a, w 7^ b] 

with \T(a,b)\ = deg(a) + deg(fr) - \N(a,b)\ where deg(a) 
is the degree of a. We have < S(a, b) < 1 with large val- 
ues of this relevance measure indicating a strong relationship 
between a and b supported by a high proportion of common 
neighbors. This quantity is similar to the clustering coefficient 
and can be generalized to involve neighbors w farther from a 
and b. 



The clustering coefficient defined earlier has little meaning 
for semantic graphs as it mixes different types of nodes and 
it does not include the constraints imposed by the ontology. 
To illustrate this, consider the ontology for a semantic graph 
given by Figure [2] In this case, a node of type a can be con- 
nected to types 0, 7 and 5, but a neighbor of type S can never 
be connected to neighbors of type (3 or 7. In order to avoid un- 
realistically small values of the clustering coefficient we thus 
have to divide by the number of links actually allowed by the 
ontology and obtain 



C(i; a) 



EL 



E(i;a) 



(3) 



where E(i; a) denotes the maximum number of links allowed 
by the ontology. 



IV. STATISTICAL MEASURES FOR SEMANTIC GRAPHS 

Along with clustering coefficient, two other relevant graph 
properties that have been developed for standard (non- 
semantic) graphs are distributions of node degree (number of 
neighbors of a node) and average path length between any 



two nodes in the graph. Together, these three graph properties 
can be useful for studying the properties of a semantic graph 
for representing knowledge. 

Many real-world networks have high clustering coefficient, 
much higher than 0(1/ n) for random graphs, where n is the 
number of nodes in the graph. We believe that properly con- 
structed semantic graphs must also have moderately high clus- 
tering coefficients. Low values of clustering coefficient may 
indicate that the linkage information in the semantic graph is 
incomplete. Very high values of clustering coefficient may 
also indicate a poorly constructed semantic graph where all 
the nodes are very highly linked to each other (the limit is a 
fully connected graph), indicating little discrimination in how 
the nodes are connected. 

The average path length, t, in a semantic graph must also 
not be too small (which is also associated with very high clus- 
tering coefficients). When the average path length is small, 
almost all nodes are approximately the same graph distance 
from each other, giving little discriminatory ability to path- 
length based algorithms for detecting relationships. 

For example, an ontology graph may contain a node (e.g., a 
node of type "provenance") to which every other node in the 
ontology is linked. In this case, the maximum shortest path 
length length in the ontology graph is 2, which also suggests 
that the average path length in the semantic graph is small. It 
may be useful to identify nodes or links in the ontology graph 
that dramatically shorten the average path length. These nodes 
and links are potentially not useful for relationship detection. 

The connectivity distribution P(k) is of interest for seman- 
tic graphs, particularly the existence of nodes with very high 
degree, as in the case of scale-free networks QSl- In a rela- 
tionship detection path search, paths through very high degree 
nodes are deemed less informative |9]. For example, in a so- 
cial network, two people who know a popular person are less 
likely to know each other; the linkages to the popular per- 
son should be disregarded in the relationship detection search 
since they may confer erroneous relationships. 

It is believed that power-law connectivity distributions arise 
when there is little or no cost involved in the formation of links 
in the network |2]. Without this property, no nodes would be 
able to acquire a very large number of links. This may suggest 
that a graph with power-law degree distribution may contain 
many weak linkages. However, these weak linkages cannot be 
disregarded; Cf. strength of weak ties, mentioned above. 

For semantic graphs, we showed above how to extend the 
concept of clustering coefficient. In the next subsections, we 
expand the potential usefulness of other concepts for semantic 
graphs. 



A. Extension of node degree 

Even in the simple case of connectivity, a given value k of 
the connectivity of a node of type a has no real meaning for 
semantic graphs. Indeed, as shown in Figure[3]the topological 
connectivity in both cases is k = 4 but the meaning of it is 
very different in each case. 

In the first case, the environment is very homogeneous 





FIG. 3: Two examples for which the Q-type node has topologi- 
cal COTmectivity_fe = 4 but with a different meaning in each case, 
Cf. lJensen & Neville! (2002). 



while it is not in the second case. Another complexity comes 
from the fact that the number of /3-type nodes can be very 
large thus inducing a bias in the connectivity of the other 
nodes. 

The ontology implies that each node of type a can be con- 
nected to a certain number, k a , of other types. In the seman- 
tic graph, we have a total number of nodes n = J2a n <* anc ^ 
we denote the nodes by i = 1, . . . , n. The type of a node is 
given by the function t(i). We denote by k a p(i) the number 
of neighbors of type (3 of a node i of type a. The usual topo- 
logical connectivity of the node i (which is of type a) is then 
given by 



k a (i) = y]k a p(i). 



(4) 



Using this quantity, we can define the average connectivity of 
type a which is just the average over all nodes with type a as 



k a — 2, 

i, t(i)—a 



k a [i). 



(5) 



If we want to compare the different types relative to then- 
connectivity, it is important to remember that some types can 
be connected to many others (such as persons which can be 
linked to others persons, cities, meeting, jobs, etc.) while 
other types are only linked to one type (such as a conference 
which takes place only at one location). In order to compare 
the different types we thus have to rescale by the number of 
different neighbor types they can have according to the ontol- 
ogy: 



k a 

To 



(6) 



This quantity indicates the average number of neighbors per 
type. This quantity however does not tell us if there are large 
connectivity fluctuations or if in contrast all nodes of a given 
type have essentially the same connectivity. We thus have to 
measure the connectivity variance per type which is calculated 
using the second moment 



^, t[i)—a 

with the dispersion per type given by 

Ot 7,0 



(7) 



(8) 
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Another possible way to characterize the connectivity dis- 
tribution per type is to plot the connectivity distribution. How- 
ever, the dispersion around the average is already a first indi- 
cation of the nature of the connections for different types. For 
some cases, the fluctuations will be small, while for others it 
can fluctuate greatly (such as the number of persons a person 
knows). 



B. Disparity of connected types 

The above quantities tell us the expected number of con- 
nections of a node of a given type to another type but not the 
correlations between different types. Indeed, a type a can 
preferentially link to a type (3 while it could be in principle 
also be linked to other types (as given by the ontology). 

We thus quantify the disparity (or affinity) of each type to 
link to other types. In order to do this we use a convenient 
quantity — denoted by Y 2 — which was introduced in another 
context 01- In order to understand the meaning of this 
quantity let us consider an object that is broken into a num- 
ber N of parts, each part having a weight By construction 
J2i w i = 1 and Y-2 is given in this case by 



Yn 



(9) 



If all parts have the same weight wj ~ 1/N then Y 2 ~ 1/N 
is small (for large N). In contrast, if we have w\ = 1/2 and 
the rest is small implying w^i ~ 1 /2(N — 1) then we obtain 
Y<z ~ 1 /4. This simple example can be easily generalized to 
more complicated situations and shows that a small value of 
Y% indicates a large number of relevant parts while a larger 
value (typically of order 1 jm where m is of order unity) indi- 
cates the dominance of a few parts. 

We now apply this idea to the number of types to quantify 
the disparity of a node or the affinity of a type. The quantity 
Y<z is first defined for a given node i of type a 



Y 2 (i;a) 







k a (i) 



(10) 



In order to get results with statistical significance, we aver- 
age this quantity over all nodes of the same type and we also 
compute its dispersion <r„ : 



Y 2 {a) 



t(i)—ot 

1/2 



Yi(a) {Y 2 {a)f 



(11) 
(12) 



These results must however be weighted by the fact that 
some types are more numerous than others which could be a 
reason why they appear more often than others. For a given 
node a, we denote by V(a) the set of types which can be con- 
nected to a as given by the ontology. If a node has k neigh- 
bors, and if these neighbors are picked at random in the set of 



different nodes with population np, we then obtain a disparity 
given by 



Yf = 



(13) 



0EV(a) 



Again, this quantity will be very small if all types are uni- 
formly present in the semantic graph Y 2 ~ 1/N (where N is 
the total number of different types) and if it is of order unity 
then essentially a few types are over-represented. In order to 
take these heterogeneities into account it is thus necessary to 
rescale Y 2 (a) by Y 2 and to form the factor 



R(a) 



Y 2 (a) 



YJ 



and its corresponding dispersion, 



(14) 



(15) 



A large value (larger than one) of R(a) indicates that type 
a preferentially links to a small number of types and that its 
neighbor types V(a) are diverse in number. If R <C 1, the 
type a may still be preferentially connected to a small set of 
types but the diversity of the numbers of each neighbor type 
is small. 

The dispersion <j R (a) indicates whether the behavior as de- 
scribed by the average value R(a) is typical, or if in contrast 
there is large diversity among the nodes of type a. 

Other usual quantities that are measured in order to char- 
acterize a large network can also be generalized without any 
difficulty. For example, degree distributions should be exam- 
ined by type of node. In a semantic graph, the overall degree 
distribution may not be meaningful, but the degree distribu- 
tion for a specific node type may be power-law, etc. As a fur- 
ther example, the average path length generalizes to become a 
matrix l a p where a indicates the source node of the shortest 
paths while f3 is the target node. This matrix will in general 
have entries with very different values. 



V. SCALE IN SEMANTIC GRAPHS 

Given a knowledge base of relational data, the choice of 
ontology depends on what information needs to be captured in 
the semantic graph, and how easily certain information needs 
to be retrieved. The level of detail (or scale) chosen for the 
ontology (choice of node and link types) will have a direct 
impact on the properties of the corresponding semantic graph. 

In the simplest ontology, we have nodes of only one type. 
In the example of the movies database, this ontology is a sim- 
ple network of actors without any types and two actors are 
connected if they played in the same movie. At the next finer 
scale, we have actors and movies as node types. In this case, 
the ontology is an actor connected to a movie if he played in 
that movie. This is a special case of a semantic graph which 
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is a bipartite network (two types of nodes, with links only be- 
tween the two types). Coarser models lose some of the infor- 
mation present in finer models but can be useful for large-scale 
computations, such as multi-level search techniques. 

At the finest scale of a terrorist network, we may have nodes 
of type "Religious Terrorist Organization" and "Political Ter- 
rorist Organization." A coarser model may aggregate nodes of 
these two types into a new type, "Terrorist Organization" (or 
the aggregation may occur directly if a type hierarchy is avail- 
able). Depending on what information needs to be preserved, 
it may or may not be important to distinguish between these 
two node types at the structural level of the semantic graph. 

We note that in Homeland Security tasks, data analysis 
more often involves searching for outliers rather than com- 
monplace patterns. Thus it is essential that the fine scale data 
is retained and the coarse scale data is used appropriately (for 
example, as an aid in managing and processing large-scale 
data). 

A. Effect of scale on statistical measures 

Here we simply illustrate the effect of scale on the cluster- 
ing coefficient. We consider a random bipartite graph with 
Poisson distributed numbers of both movies per actor (with 
average n) and actors per movie (with average v). We sup- 
pose that we have jia actors and um movies and the fact that 
each link connects an actor to a movie imposes the constraint 

u v 

— = ■ (16) 

fiA n M 

This model can be considered as a "null" model since there 
are no particular correlations here. If one computes the clus- 
tering coefficient of the one-mode projection of this network, 
one obtains ffl 

C=— l —. (17) 
fi + l 

This quantity is finite even in the limit of very large networks 
nA,M — > oo. This is in contrast with the usual random net- 
work for which 

C~- (18) 

n 

where n is the number of nodes. At this stage the conclusion 
is that the actor network is very clustered and different from a 
random network with no correlations. This is however clearly 
an incorrect statement since the existence of a large clustering 
coefficient here is a consequence of the network construction 
procedure. 

VI. EXAMPLES 

A. Movies data 

The "Movies" test data at the UCI KDD Archive contains 
information about movies, persons (actors, directors, etc.), 




FIG. 4: Movies ontology. 



studios, awards, etc. The data was originally compiled by 
Gio Wiederhold (Stanford University). We used this data to 
construct an ontology and semantic graph to express most of 
the information in the dataset. Figure |4] shows the ontology 
graph that we developed. In the figure, the meaning of most of 
the links is obvious. However, the person-person link implies 
married-to, lived-with, or some other non-professional rela- 
tionship; the person-studio link implies founded; the movie- 
movie link implies sequel-to. We note that the data is very 
incomplete. 

In this ontology, the best meaning of the node Role is un- 
clear. For example, are two actors linked to the same Role 
node in the semantic graph if they played the role of Villain 
in two different movies? Alternatively, a role node in the se- 
mantic graph may only link to actors playing a given role in a 
single movie. We arbitrarily chose the former in our case. 

A related question, which is structurally similar but seman- 
tically different is the following. Should two actors who win 
a Best Actor award be linked to the same Award node in the 
semantic graph? In this case we did not choose this interpreta- 
tion since it seems that awards are individual entities, whereas 
roles are not. 

Table H] summarizes the node types, frequencies, and other 
statistical measures for the movies semantic graph. The re- 
sults show high dispersion of average connectivity per type, 
for all types. Further, the disparity of connected types is not 
particularly different from a random model. These indicate a 
relatively well-constructed semantic graph; there are no par- 
ticular correlations (given the numbers of each node type) and 
thus the information content in the graph is high. The results 
will be very different for the terrorism data. 

In the semantic graph, the nodes with the largest clustering 
coefficients depend on whether the types of the nodes are con- 
sidered. In the standard case where the types are not consid- 
ered, the node Maurice Barry more has high clustering coef- 
ficient; the node is connected to Georgiana Drew Barrymore, 
Lionel Barrymore, Ethel Barrymore, etc., all of which are con- 
nected to each other. If node types are considered, then it is 
not important that neighbors of a node are not linked if they 
are not permitted to be linked according to the ontology. Now 
nodes that were missed with the above measure may have high 
clustering coefficient, e.g., the movie Dogma (perhaps due to 
the idiosyncrasies of the incomplete data). 
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Node Type 


n a 


m a 




R( a ) 


< 


1 


Person 


21504 


0.872 


2.383 


1.836 


0.663 


2 


Movie 


11540 


1.131 


0.816 


1.299 


0.644 


3 


Award 


6734 


2.579 


10.201 


0.905 


0.144 


4 Country 


19 


222.509 


582.572 


1.812 


0.364 


5 


Studio 


1075 


1.948 


9.534 


1.241 


0.408 


6 


Genre 


39 


77.803 


160.060 


0.512 


0.154 


7 


Role 


115 


25.561 


64.164 


0.924 


0.028 


8 


Distributor 


16 


206.156 


356.043 


0.782 


0.165 



TABLE I: Node types and statistics for the movies data: frequency of 
node type n a , average connectivity per type m a and its dispersion 
a\, disparity of connected types R(a) and its dispersion af. The 
results show high dispersion of average connectivity per type, for all 
types. Further, the disparity of connected types is not particularly 
different from a random model. 



In the semantic graph, the link between Columbia Pictures 
and drama (genre) has the most number of common neighbors 
(710). However, when the link relevance measure (Equation 
(0) is used, which accounts for the number of links a node 
has, the link between Bud Abbott and Lou Costello is found 
(30 common neighbors). (We also found re-releases of movies 
under a new name in this process.) Further, a semantic version 
of relevance can be defined, which considers only the links 
that are allowed by the semantic graph. In this case, the link 
between Tokuma Studio and docu-drama is found. (Tokuma 
is linked to drama and the movie Carences; docu-drama is 
linked to Carences and Miramax; and Miramax is linked to 
drama.) 

We also computed the average relevance per link type for 
the semantic graph. First, the link types of least frequency 
were Person-/oM/iafec/-Studio and Studio-/ocafec/-/n-Country. 
However, the links with lowest average relevance per link 
were Movie-s/iof-/«-Country and Award-avvanfeii-/n-Country. 
As mentioned, these latter links may by least useful for auto- 
matic relationship detection. 

B. Terrorism data 

Relational data about world-wide terrorist events is 
available. l2(ill as well as ontologies describing the organiza- 
tion of this data 1 16]. From this data we constructed an on- 
tology and semantic graph. The 59 node types are shown in 
Table HTl The ontology is shown in Figure [5] as an adjacency 
matrix. The semantic graph contains 2366 nodes. 

Figures |5] and plot the average number of neighbors per 
type and the disparity of connected types, respectively. Er- 
ror bars are used to show the dispersion of the quantities. We 
consider that frequencies of 50 or more in this data set are 
statistically significant. Thus, we consider types 1, 2, 3, 28, 
30, 31, 32, 36, 37 42, and 50. For all these types, the average 
number of neighbors per type is small. The types, however, 
can be separated by their disparity. Types 1,2, 3, 28, and 50 
have high disparity, i.e., they are connected to many different 
types. This is consistent with nodes of types 1, 2, and 3 being 
of type "location," nodes of type 28 being of type "terrorist 



Type 


n a 


Type 


n a 


1 


Nation 


92 


3 1 


Snooting 


445 


2 


GeographicalRegion 


85 


32 


Bombing 


323 


3 


City 


555 


33 


HostageTaking 


14 


4 


Building 


10 


34 


IncendDeviceAttack 


18 


5 


Combustion 


o 


35 


Lynching 


3 


6 


Destruction 


o 


36 


SuicideBombing 


107 


7 


Device 


Q 


37 


CarBombing 


1 14 


8 


GeographicArea 


3 


38 


Arson 


15 


9 


Government 


I 


39 


Hand grenade Attack 


38 


10 


Governmentrei son 


2 


40 


Hijacking 


15 


11 


Group 


\ 


41 


RocketMissile Attack 


14 


1 2 


Hole 


\ 


42 


KnifeAttack 


53 


1 3 


Human 


5 


43 


Chemical Attack 


9 


1 A 
1 4 


J o i n i ng A nOrg 


o 


A A 
44 


Letter B o mb Att ac k 


10 


1 5 


Killing 





45 


Stoning 


3 


1 6 


OccupationalRole 


n, 


46 


Vehicle Attack 


7 


17 


Region 


o 


47 


MortarAttack 


g 


1 8 


SocialRole 




48 


Vandalism 


4 


19 


S tationary Artifact 




49 


Other 


5 


20 


UnilateralGetting 


o 


50 


Number 


120 


2 1 


Vehicle 




5 1 


Continent 


2 


22 


ViolentContest 




52 


General Structure 




23 


Weapon 


o 


53 


Month 


12 


24 


Proposition 


o 


54 


GeneralBuilding 


2 


25 


Binary Predicate 


o 


55 


GeneralHuman 


2 


26 


ForeignTerrOrg 


28 


56 


Airbase 


2 


27 


ReligiousOrg 





57 


Airport 


3 


28 


Tenon stOrg 


53 


58 


State 


4 


29 


Infiltration 


8 


59 


Railway 


1 


30 


Kidnapping 


155 









TABLE II: Node types and their frequencies, n a , for the terrorism 
data. 




FIG. 5: Adjacency matrix for the terrorism ontology. The matrix is 
used to determine which node types are allowed to link to a given 
type. 



organization," and nodes of type 50 being of type "number." 
The remaining types are types of attacks and are not particu- 
larly correlated with any other node types (given the numbers 
of each node type). We note in this case that semantically 
similar node types have similar values of m a and R(a). 
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15 
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Node Type 

FIG. 6: Terrorism data: average number of neighbors per type, m a . 
Each error bar is of length cr^ on each side of the average. 
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Node Type 

FIG. 7: Terrorism data: disparity of connected types, R(a). Each 
error bar is of length a„ on each side. 



VII. CONCLUSION 

This paper reveals some of the knowledge representation 
issues associated with semantic graphs. Ideas from the field 
of complex networks have been applied and generalized to 
semantic graphs. For example, transitivity may be used to de- 
termine the relevance of edge types for relationship detection. 

We have defined several measures for statistically charac- 
terizing node types. These quantities take into account the 
ontology which specifies the permitted connections in the se- 
mantic graph. Many other important measures can be defined, 
such as correlations with attribute values 1 1 1], which was not 
covered in this paper. These and other tools can be useful 
to help design ontologies and semantic graphs for knowledge 
representation. 
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